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ABSTRACT 

Twitter is an extremely popular social networking platform. 
Most Twitter users do not disclose their locations due to pri¬ 
vacy concerns. Although inferring the location of an individ¬ 
ual Twitter user has been extensively studied, it is still missing 
to effectively find the majority of the users in a specific geo¬ 
graphical area without scanning the whole Twittersphere, and 
obtaining these users will result in both positive and negative 
significance. In this paper, we propose Locinfer, a novel and 
lightweight system to tackle this problem. Locinfer explores 
the fact that user communications in Twitter exhibit strong geo¬ 
graphic locality, which we validate through large-scale datasets. 
Based on the experiments from four representative metropoli¬ 
tan areas in U.S., Locinfer can discover on average 86.6% of 
the users with 73.2% accuracy in each area by only checking a 
small set of candidate users. We also present a countermeasure 
to the users highly sensitive to location privacy and show its 
efficacy by simulations. 

1. INTRODUCTION 

Twitter is an extremely popular social networking tool for 
communicating through short messages called tweets. As of 
July 2014, Twitter has 255 million monthly active users and 
500 million daily tweets. Due to such massive user bases and 
popular usage, Twitter has been increasingly used in social 
communications, information campaigns, public relations, po¬ 
litical campaigns, pandemic and crisis situations, marketing, 
and many other public/private contexts. 

User privacy is arguably a major concern about Twitter. Specif¬ 
ically, user profiles and tweets may contain sensitive informa¬ 
tion about life, work, health, hobbies, political opinions, etc. 
Twitter currently offers little protection for user profiles and 
tweets which are virtually visible to anyone with or without 
an account]^ Consequently, many users employ pseudonyms 
instead of real names in their profiles. In addition, Twitter 
users often hide their home locations (location for short there¬ 
after), which are permanent and static city-level regions (e.g., 

'Although Twitter allows a user to make his information visible to 
approved followers only, this privacy enhancement is rarely used in 
practice. 


Philadelphia) where most of their daily activities occur. Specif¬ 
ically, they may either not indicate their locations or report very 
general locations (e.g., state-level) in their profiles; they may 
not indicate their locations in their tweets either. For example, 
less than 34% of Twitter users explicitly specify their locations 
in their profiles |[T], only 16% of Twitter users indicate city- 
level locations, and only 0.5% of tweets have a geo-tag 

There have been some efforts to infer a Twitter user’s hid¬ 
den location. Content-based methods ||2jQ try to infer hidden 
locations based on geographic hints such as city landmarks in 
tweets. For example, a user who frequently mentions “Golden 
Bridge” in his tweets may indicate his location in the Bay Area. 
In contrast, network-based methods GHID leverage the fact 
that geographically-close people tend to form a connection or 
community in Online Social Networks (OSNs) 1121, so a user’s 
location can be inferred from those of his online neighbors (or 
neighbors’ neighbors, etc). Based on different estimation tech¬ 
niques, all these efforts dn) seek to address the same ques¬ 
tion: how can we infer a Twitter user’s hidden location from all 
his location-related tweets and/or OSN neighbors’ locations? 

This paper targets a different and more challenging prob¬ 
lem: is it feasible to efficiently discover the majority of Twitter 
users in any city-level metropolitan area (A) without collabo¬ 
rating with Twitter? Since only 16% of Twitter users register 
city-level locations 0, it is infeasible to tackle our problem 
by directly checking users’ tweets and profiles. In addition, 
directly applying any prior solution dm would inevitably 
involve checking every (255 million) Twitter user’s tweets, fol¬ 
lowers, and/or followees, thus leading to a prohibitive cost. 

An affirmative answer to our target problem above would 
have significant positive and negative impacts. On the positive 
side, finding the majority of the users in a specific area can not 
only benefit many applications such as local event detection 
and recommendation, business marketing, and em¬ 
ergency-alert dissemination, but also offer a feasible way to 
sample Twitter to facilitate the research concerning geograph¬ 
ically related information. On the negative side, if an attacker 
can infer the majority of the Twitter users in a specific area, 
he could easily combine the location information with user 
tweets to better profile Twitter users who may or may not use 
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pseudonyms, thus breaching their privacy and subjecting them 
to many identity-based attacks. Moreover, the Twitter users 
with exposed locations are vulnerable to large-scale location- 
based or geo-targeted spam campaigns HD- 

In this paper, we propose Loclnfer, a novel and lightweight 
solution to the above problem for the first time in literature. 
The design of Loclnfer is driven by two conjectures. First, 
a small but nontrivial fraction of users (15.9% on average in 
our datasets) have specified a credible location in the target 
area A in their personal profiles, each of which is referred 
to as a seed user hereafter. Second, user communications in 
Twitter exhibit strong geographic locality in the sense that the 
users in the same area tend to interact more often than with 
those from outside. We confirm these two conjectures through 
large-scale datasets involving four representative metropolitan 
areas in U.S. Built upon these conjectures, Loclnfer iteratively 
checks the immediate neighbors of the seed set, and the users 
who have tight connections with the seed set become new seeds 
and are added to the seed set. The final seed set contains the 
majority of Twitter users in A with overwhelming probability. 
Loclnfer is highly efficient because only a small number of 
candidate users need to be checked in contrast to almost all the 
Twitter users when the existing methods BO are applied to 
our problem. 

Our contributions can be summarized as follows. 

• We motivate and formulate the problem of large-scale lo¬ 
cation inference, which is challenging given that only a 
small fraction of Twitter users have specified a credible 
city-level location in their personal profiles. 

• We design Loclnfer, a novel and lightweight solution that 
can uncover the majority of the Twitter users in a specific 
metropolitan area. 

• We conduct extensive experiments to evaluate Loclnfer 
using four large-scale datasets. Our results show that 
Loclnfer can successfully discover on average 86.6% of 
the users with 73.2% accuracy. 

• We propose a countermeasure against Loclnfer for the 
Twitter users worrying about their location privacy and 
evaluate its effectiveness via simluations. 

The rest of this paper is organized as follows. Section 
defines the problem. Section validates our two conjectures 
through four large-scale datasets. Section|^details the Loclnfer 
design. Section [^evaluates Loclnfer and our countermeasure. 
Section 1^ surveys the related work. Section |7] concludes this 
paper and presents some future work. 

2. PROBLEM STATEMENT, TERMS AND NO¬ 
TATION 

We use a directed and weighted multigrapl0to model the di¬ 
verse communications between Twitter users. In Twitter, peo¬ 
ple can follow others without mutual consent; they can mention 

^In a multigraph, two vertices may be connected by more than one 
edge. 


others in their own tweets; they can also reply to or retweet 
others’ tweets. We classify these communications into two cat¬ 
egories: following and interacting (retweeting, replying, and 
mentioning), denoted by symbols and I, respectively. Such 
diverse communications are modeled as a directed and weighted 
multigraph G = {V,E), where each vertex v G V represents 
a user. We refer to a directed edge for the following type as a 
following edge and a directed edge for the interacting type as 
an interacting edge. A following edge efjGEis formed when 
user i followed j ; we call user i follower of j and j afollowee 
of i. In contrast, an interacting edge G E is formed when 
user i mentioned, replied to, or retweeted j at least once; we 
call user i a responder of j and j an initiator of i. To model the 
interaction strength, we define w{efj), the weight of edge efj, 
as the total number of retweets, replies, and mentions from user 
i to j. For consistency, we also define the weight of any fol¬ 
lowing edge as one. We use Nf (u), N^iu),Nfiu),NSiu) 
to represent u’s one-hop followers, followees, responders, and 
initiators, respectively. We also define the one-hop neighbors 
of u as N{u) = Nf (u) U Nq{u) U Nf{u) U Nq\u). 

Large-Scale Location Inference. Given a Twitter multigraph 
G = {V, E) and a target metropolitan area A, we aim to obtain 
a target user list U which contains the majority of Twitter users 
in A without collaborating with Twitter. 

Design goals. Loclnfer is designed with the following goals. 

• High coverage. The target user list U should cover the 
majority of Twitter users in A. If we denote the actual 
Twitter users in A by U*, the coverage can be computed 

as \U r\U*\/\U*\. 

• High accMrac3[^ The target users in U should be indeed 
located in A. The accuracy can be computed as |C/ n 

U*\/\U\. 

• Efficiency. Loclnfer should only involve checking Twit¬ 
ter users proportional in quantity to the population in A 
in contrast to existing methods @0 which all need to 
check all the Twitter users. This efficiency requirement is 
particularly important because without Twitter’s collabo¬ 
ration, the only free way to obtain the users’ information 
is via third-party APIs, which is time-consum¬ 
ing as Twitter has strict rate limits on APIs invoking 0- 
For example, an authenticated user can only invoke the 
get-followers API 15 times per 15 minutes. Flence if we 
invoke this API once for each of the 255 million Twitter 
users, it will spend a single authenticated user about 485 
years to obtain all the Twitter users’ followers. 

3. CONJECTURES VALIDATION 

^Note that coverage and accuracy correspond to the widely-used re¬ 
call and precision, respectively. In this paper we use the coverage and 
accuracy to make the meaning more straightforward in the context of 
user uncovering in an area. 
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Table 1: Seed users in four metropolitan areas in U.S. 


Area A 

Population 
(rank in U.S.) 

#Twitter users 

#seed users 
(over #Twitter users) 

#seed users with 

> 1 million followers 

Tucson (TS) 

996,544 (57th) 

150,478 

28,161 (18.65%) 

0 

Philadelphia (PI) 

6,034,678 (7th) 

911,236 

144,033 (15.9%) 

3 

Chicago (Cl) 

9,522,434 (3rd) 

1,437,888 

318,632 (22.21%) 

11 

Los Angeles (LA) 

16,400,000 (2nd) 

2,476,400 

300,148 (12.12%) 

174 


As we mentioned in Section Locinfer is built upon two 
important conjectures. 

• Conjecture 1: A small but nontrivial fraction of users 
have specified a credible location in the target area A in 
their personal profiles. 

• Conjecture 2: User communications in Twitter exhibit 
strong geographic locality in the sense that the users in 
the same area tend to communicate more often than with 
those from outside. 

In this section, we validate these two conjectures using four 
large-scale datasets. 

3.1 Data Collection 

We collect ground-truth Twitter users in different metropoli¬ 
tan areas by checking the self-reported locations in their pro¬ 
files, a methodology that has been used to obtain the ground 
truth in BHII- Specifically, we use the Twitter geo-search 
API designed to return the recent or popular tweets in a spec¬ 
ified geo-circle defined by latitude, longitude, and radius IB- 
For any interested area A, we convert it into a geo-circle for 
the geo-search API, and we do not differentiate A and its cor¬ 
responding geo-circle hereafter. The geo-search API returns 
the tweets from three types of users. 

• Geo-tagged users: The users who recently published some 
tweets with a geo-tag in A. 

• Geo-profiled users: The users whose personal profiles 
containing a location in A. 

• Retweeting users: The users who recently retweeted some 
geo-tagged or geo-profiled users’ tweets in A. 

Among them, we only use the geo-profiled users to build our 
datasets, because retweeting users are likely not in A, and geo- 
tagged users may have just traveled to some places within the 
geo-circle instead of living there. Moreover, since the result of 
each geo-search API invoking corresponds to a random sam¬ 
pling of the active Twitter users, we keep invoking the geo¬ 
search API until no significantly more geo-profiled users can 
be discovered. 

The self-reported locations have been found reliable IB’ 
but the results from the geo-search API are still noisy for two 
reasons. First, the location descriptions in many users’ pro¬ 
files are ambiguous and arbitrary. For example, people living 


in Los Angeles may specify their locations as “South Califor¬ 
nia”, or “Los Angeles”, or “LA”, or just “CA.” Second, the geo¬ 
search API often needs to covert a location description into a 
longitude-latitude pair for comparison with the specified geo¬ 
circle. Such conversions are often problematic and thus lead to 
wrong results. For example, when we searched the users in San 
Francisco Bay Area, the geo-search API returned some users in 
other places or even nonsense descriptions such as “somewhere 
you’re not” and “wherever you not.” 

We thus refine the geo-profiled users as follows. For each 
user, we further verify whether his/her location description in¬ 
deed contains a city name in A. For this purpose, we first ob¬ 
tain the list of city names in A from the latest U.S. gazetteer 
data tB and then compare the location description with the 
list. If there is an intersection, the user is considered a ground- 
truth user in A. 


3.2 Datasets 

Using the above method, we collect user data in four met¬ 
ropolitan areas of Tuscon (Arizona), Philadelphia, Chicago, 
and Los Angeles. Our data collection ran from January to June 
2014. Table[T]summarizes the four datasets. As we can see, the 
four populations vary from one million in TS to 16 millions in 
LA, from the not-so-popular areas (e.g., TS) to popular areas 
(e.g., LA). Note that all the metropolitan population informa¬ 
tion is from the U.S. Census Bureau website. 


3.3 Conjecture Validation 

To validate the first conjecture above, we estimate the num¬ 
ber of Twitter users for each area according to the eMarketer 
report claiming that 15.1% of U.S. people are using Twitter as 
of Feb. 2014 |jT^. As we can see from Table [T] the seed users 
range from 12.12% in LA to 22.21% in Cl with the average 
ratio of 15.9%. This result is consistent with the measurement 
in Q and implies that we have almost crawled all the users 
who have specified their city-level locations in these areas. 

To validate the second conjecture above, we first define three 
locality metrics. In particular, for the multigraph G = {V,E) 
defined in Section]^ let V denote any subset of V. We define 
follower locality /foiiower(^0’ followee locality 
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Table 2: Locality in each area. Each element is composed of 
three values, representing the locality for the seed users in 
each area, the first type of random user set, and the second 
type of random user set, respectively. 


A 

^follower(C^) (%) 

^followee(^) (^) 

^initiator(C^) (%) 

TS 

PI 

Cl 

LA 

8.1 10.0810.04 

4.9 1 0.4 1 0.2 

6.9 1 1.010.5 
1.5 10.9 10.5 

9.2 1 0.3 1 0.1 
8.41 1.5 10.6 
10.31 3.3 1 1.3 
8.413.1 1 1.2 

12.8 1 0.5 1 0.2 

14.9 12.7 1 1.2 

16.9 15.212.6 
17.015.012.5 


Table 3: Breaking down the initiator locality by three types 


of interactions. 


A 

Replying (%) 

Retweeting (%) 

Mentioning (%) 

TS 

14.41 

10.61 

13.37 

PI 

14.05 

12.95 

16.99 

Cl 

17.46 

14.23 

18.50 

LA 

15.43 

15.44 

19.05 


^foiiowee(l^')’ and initiator locality Zinitiator(i^') as 


/ 


follower 


(V) 


\Nf{v')nv'\ 

\Nf{V')\ 




followee 


iv') 


and initiator (1^ ) — 


_ \N^ir)nr\ 
\N§{V')\ 
w{N^iv')nv') 

( 1 ) 


respectively, where Nf {V'),Nq{V'), and Nq{V') represent 
the followers, followees, and initiators of V' , respectively, and 
w{-) represents the total weight of the corresponding interact¬ 
ing edges. 

We let V' equal the seed users in each area and then compute 
the corresponding locality. To do so, we crawl all the followers 
and followees of each seed user, and we also crawl the latest 
600 tweets of each seed user to extract their initiators. For the 
comparison purpose, we build two types of random user sets. 
First, we merge the four seed sets into a single set from which 
we randomly select the same number of users as the seeds in 
each area. Second, we randomly select from the whole Twitter 
system the same number of users as the seeds in each area and 
compute their corresponding locality. We build 10 different 
user sets for both random user sets. 

Table|^shows the results of the locality analysis. We can see 
that the three locality values of the seed users in each area are 
always much higher than those of the random user sets. This 
result confirms our conjecture that physical proximity plays a 
big role in enabling online communications in Twitter. More¬ 
over, Table|^shows a higher percentage of a user’s initiators in 
the same area than that of his/her followees. It is not surprising 
because a user may follow many people in different areas but 
often interact with only a few selected followees. In addition, 
we can see that the followee locality is much higher than the 
follower locality except in TS. The reason can be explained as 
follows. A celebrity user such as @rihanna can easily attract 
millions of followers from around the world, but she may only 



Figure 1: The average local neighbors of the seed users. 

follow relatively fewer people. So we can expect a higher per¬ 
centage of her followees in the same area (Los Angeles) than 
that of her followers. Since each of the areas except Tuscon has 
a large number of celebrity users, the followee locality is much 
higher than the follower locality. In contrast, Tuscon is a much 
smaller area with relatively few celebrity users, so we can ex¬ 
pect similar followee and follower locality. Table also shows 
that mentions, replies, and retweets contribute similarly to the 
initiator locality of each seed set, so we do not distinguish them 
in the Locinfer design. 

Finally, although interacting communications (replies, men¬ 
tions, and retweets) show much stronger locality than follow¬ 
ing communications, Fig.[T]shows that the corresponding inter¬ 
acting edges (i.e., initiators) are much fewer than the following 
edges (i.e., followers and followees), meaning that people in¬ 
teract less than they follow others. Moreover, we also observe 
that people usually interact with the ones who they follow or 
follow them. In particular, let us define the overlap between 
Nq{V') and Nf \v') U NpiV') for each area as 

|jvg(r)n(jvf(r)ujv^(n)l 

|iVg(LOI 

Our analysis shows that the average overlap for the four areas 
is 96.2%. 

4. LOCINFER 

As stated before, our goal is to uncover the majority of Twit¬ 
ter users in an area A. A naive solution is to use existing lo¬ 
cation inference methods BO for estimating the location of 
every Twitter user and then select the ones in A. However, 
these methods are impratical for our problem. In particular, 
they would require crawling the followers, the followees, and 
many tweets for all the 255 million active Twitter users. Since 
Twitter has strict rate limits on data crawling p^ , the crawl¬ 
ing process for these methods will be time-consuming. In ad¬ 
dition, the network-based methods CHn) need to store and 
process the edges of the whole Twitter graph, thus leading to 
prohibitive storage and processing costs. 

Now we present Locinfer, an efficient and effective three- 
step system to identify the majority of users in A. As men¬ 
tioned earlier, Locinfer is built upon two conjectures which 
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have been experimentally validated in Section]^ First, we can 
find a nontrivial number (15.9% from our datasets) of users 
who have explicitly indicated a location in A through their per¬ 
sonal profiles. These users are referred to as seed users (or 
seeds) in A and denoted by S. Second, user communications 
in Twitter exhibit strong geographic locality in the sense that 
users in the same area tend to have more intensive communica¬ 
tions with each other in Twitter than with those from outside. 
Based on these two conjectures, Loclnfer first builds a seed set 
S (step 1 in Section 4.1 1 and then checks the one-hop neigh¬ 
bors of the seed set S, which constitute a candidate set denoted 
by C (step 2 in Section |4.2| i. Because of nontrivial seed set 
S and the strong geographic locality, C will cover the majority 
of the users in A, but also include many users outside. Hence 
Loclnfer chooses the candidate users who have tight connec¬ 
tions with S as new seeds and add them to S, and this process 
continues until some termination conditions are met (step 3 in 
Section 4.3 i. The final seed set S contains the majority of 
Twitter users in A with overwhelming probability. Loclnfer 
is highly efficient because it only checks a much smaller set 
of Twitter users in contrast to all the Twitter users if existing 
methods HHII) are applied. 

We notice that many community structures (e.g., a group of 
people in different locations with common interests or past ex¬ 
perience like classmates and colleagues) rather than the geo¬ 
graphic community may also yield strong inter-connect- 
ions. Hence Loclnfer may include some users outside A in the 
candidate set C. However, the impact of such outside users is 
minimal because Loclnfer only selects the users in the target 
area A as the seeds S and only chooses the target users who 
have strong communications with S later. 


4.1 Step 1: Finding Seed Users 

The first step in Loclnfer is to extract the seed users who are 
most certainly in A. To that end, we use the same method as 
in Section [^by invoking the Twitter geo-search API to obtain 
the geo-profiled users and then refine them by checking their 
location descriptions to build the seed set S in A. 

It is possible that some people may specify the fake home lo¬ 
cations in their profiles, and it is infeasible to completely pin¬ 
point and exclude such users. Fortunately, such self-reported 
locations have been verified to be very reliable o and have 
been used as the ground truth in IIHTT). Meanwhile, we may 
accidentally exclude some users indeed in A, which is quite ac¬ 
ceptable given our focus on obtaining a reliable seed set in this 
step. We admit that more advanced methods can be used for 
the seed searching and refinement, which are left for the future 
work. 


4.2 Step 2: Finding Candidate Users 

Based on the nontrivial number of seed users, the second 
step then is to construct a candidate-user set C from the one- 
hop neighbors of S that potentially covers the majority of Twit¬ 
ter users in A but is also much smaller than the set of all Twit¬ 
ter users. Below we first discuss how we decide the candidate 
users in C and then theoretically analyze the coverage of C. 


4.2.1 Choosing C 

We first build the candidate set C from the one-hop neigh¬ 
bors of S. The underlying intuition is based on the two conjec¬ 
tures validated in Section]^ Specifically, The second conjec¬ 
ture indicates that the users in the same geographic area tend 
to communicate more densely among themselves than to those 
from outside. On the one hand, if a user has very limited com¬ 
munications to all the seed users in S which occupies about 
15.9% of the total users in A, with high probability he/she is not 
in A; on the other hand, a user that is indeed in A is very likely 
to have direct communication with some seeds. We therefore 
choose to build the candidate set C from the one-hop neighbors 
of S, denoted as N{S). 

Two details need further consideration. As defined in Sec¬ 
tion 1^ each Twitter user has four kinds of neighbors in G = 
(y, E): followers, followees, initiators, and responders. Which 
neighbors should we choose for each seed user? We observe 
from Fig. that many Twitter users may follow a large num¬ 
ber of other users, but they tend to subsequently interact with 
relatively few followees. Since people usually interact with the 
ones who they follow or follow them (with averagely 96.2% 
of overlap as stated in Section and C should cover as many 
users as possible in A, we consider all the followers and fol¬ 
lowees of each seed user in this step. Moreover, since each 
user in Twitter can follow arbitrary users without prior con¬ 
sent, the unidirectional following relationship is not a reliable 
indicator of geographic closeness. To deal with this issue, we 
propose to only select the candidate users to be the followers 
and followees of each seed user in S with each having at least 
t followees and t followers in S, where f is a system threshold. 

More formally speaking, for each user u G Nq {S)G>Nf (S), 
we compute = |7V/^(u)nS'| andn^(M) = |7 Vq (u)n«S'|. 
If both nf {u) and (u) are no less than t, user u is added to 
the candidate set C and ignored otherwise. 

Alg.[T]implements the overall process. Specifically, we first 
create a followee counter and a follower counter for each user 
in Nq{S) U Nf (S). Then we traverse the followee and fol¬ 
lower list of each seed and increase the corresponding followee 
and follower counters. If both the followee and follower coun¬ 
ters exceed t, we choose the user u as a candidate. 

4.2.2 Coverage of C 

The number of candidate users (i.e., |G|) is determined by 
both the number of seed users (i.e., |S'|) and the system param¬ 
eter t. A natural question is whether C can cover the majority 
of users in target area A. It is important because the new seeds 
(or equivalently the target users) will be found only from C. 

To analyze the coverage of G, we first define the following 
terms and notation. We call users i and j mutual followers if 
they follow each other. Let Ga = (V 41 Eh) be a subgraph of 
the Twitter multigraph G = {V, E), where Va C L is the set of 
the Twitter users in the target area A, and Aa C A is the set of 
the directed following edges among the users in Va. Consider a 
seed set S' C Va with s = |S| = a|yA| users, where a G (0,1]. 
Let W* (S) denote the set of the followers and followees of S, 
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Algorithm 1: Obtain the candidate set C by only checking 
the followee and follower lists of the seed set S. _ 

input : S,N^iS),Nf{S),t 
output: the candidate set C 
C ^ 0; Co[u] •(— 0,Vu S Nq{S)-, 

Ci[v] ^ 0,Vt; e Nf{S)- 
for u G S' do 

Co[v\ + +,Vt; G N^{u); Ci[v\ + +,Vt; G Nf (u); 

end 

for u G TVj(S) do 

if Co[u] > t and u G Nf (S) and Ci[v\ > t then 
C^C + {u}- 

end 

end 

return C. 


each having at least t followers and t followees in S, where t 
is the system threshold stated before. The coverage ratio of C 
is defined as r(t) = following 

theoretical results about the coverage of C given |S| and t. 

Theorem 1. Assume that each user in Va has on average 
dm mutual followers in Va. When |t4.| large enough, the 
expected coverage ratio is r{t) > 1 — (1 — a) 

Et; «)(*)•• 

Proof. We first construct an undirected graph G' = (Va, 
E'), where an edge e'^ G E' is formed if and only if users i 
and j are mutual followers. Let N *^{S) he the set of neighbors 
of S in G", each having at least t neighbors in S. We proceed 
to define the coverage of S in G' as r'{t) = |7V *(S) U SI/IVaI- 

We now compute r'{t). Since each user has on average dm 
edges in E', the probability of one user connecting to any other 
user is p = . Moreover, since there are s = a | Va | seed 

users, the probability of any non-seed node u connecting to less 
than t seed users in G' is given by 

i=0 ^ ' i=0 ^ ^ ^ 


Since each edge in E' corresponds to two directed edges 
in E, all the users in n'*{S) must belong to N*{S). On the 
other hand, a user in N*^{S) may not appear in N ‘(S'). For 
example, consider a user who has exactly t followers and t 
followees in S in graph G, where none of his followers and 
followees are the same. Then this user is an isolated vertex in 
G', and he is certainly in iV*(S) but not in N'*{S). Therefore, 
we have N ‘(S) C iV‘(S) and r'(f) < f(f), and the theorem 
is proved. □ 

Corollary 1. f (t = 1) > 1 - - a). 

Corollary 2. f(t = 2) > l-e““‘‘™(l-a)(l + a(im). 

Since IVaI is often large in practice. Theorem indicates 
that the coverage ratio f{t) approaches 1 when adm is large 
enough. Moreover, the choice of t involves a tradeoff between 
the crawling cost and the coverage. Specifically, the larger the 
t, the fewer the candidates in G, the smaller the crawling cost, 
the more likely to miss some users in A (i.e., the lower cover¬ 
age), and vice versa. The size of S also affects the choice of 
t. On the one hand, if S constitutes a relatively large portion 
of the users in A (say, a = 30%), it may be safe to use larger 
t because many users in A are more likely to have more fol¬ 
lowees and followers in S. On the other hand, if S constitutes 
a relatively small portion of the users in A (say, a = 10%), it 
may be safe to use smaller t to avoid excluding too many users 
in A. 

Here we illustrate how many seeds are needed to achieve 
a nearly 100% coverage. Assume that each user in A has on 
average 15 mutual followers (i.e., dm = 15). According to 
Corollaries[T]and|^ when t = 1, 20% of the users as seeds can 
cover 96.02% of the target users in A, and when t = 2, 20% 
and 30% of the users as seeds can cover 84.07% and 95.72% of 
the users in A, respectively. Similarly, if dm = 30, only 10% 
and 15% of the users as seeds can cover 95.52% and 95.99% 
of the users for t = 1 and t = 2, respectively. These results 
indicate that when each user has sufficient mutual followers in 
A, the followers and followees of a small number of seeds can 
cover the majority of the target users in A. 


When the number of users in 14. is large, we have 

lim|y^|^+oo(l -p)® = limn4|^+oo(l - dm/\VA\T^^'^^ 


Since there are |14| — s non-seed users, the expected number 
of non-seed users connecting to t or more seeds in S can be 
computed as (|14| — s)(l — p) = |Va|(1 — ct)(l — p). When 
I Va I is large, we have 


rft) 


|iV'‘(.S)U .51/1141 

1 - (1 - a)p 


1-(1 



4.3 Step 3: Finding Target Users U 

Although the candidate set G covers nearly all the users in A 
for proper t, it may contain many users not in A who neverthe¬ 
less have at least t followees and also t followers in the seed set 
S. For example, social butterflies m or social capitalists GZ) 
have been reported to automatically follow back whoever fol¬ 
lows them, and users may also follow each other due to reci¬ 
procity |Tg[T7). We thus design the next step to identify the 
target user set [/ in A from G using both the following and 
interacting connections among the users. 

Our key observation as stated is that each target user is very 
likely to demonstrate significant locality with the seed user set 
S. In other words, we expect that the target users form a strong 
local community with the seed users. From the initial seed set 
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S, we iteratively check the candidate users in C, and the candi¬ 
date who has the highest locality value with the seeds becomes 
a new seed and is added to S. The process iterates until certain 
conditions are met. 

How should we compute the locality of Twitter users with 
diverse communications? Inspired by the Eq. Q, we consider 
three types of locality for any candidate user u G C: follower 
locality ?foiiower(M), followee locality ^foiiowee(M), and initiator 
locality /initiator( m), which are computed as 


^follower 


(u) 


\Nf{u)ns\ 

\Nf{u)\ 


I 


followee 


(m) 


and /initiator (tt) — 


\N§iu)\ 

w[Nq{u) n s) 

w{N^{u)) 


(3) 


where Nf (u), NSiu), and Nq (m) are u’s followers, followees, 
and initiators, respectively, and w{-) denotes the total weight of 
the corresponding interacting edges. 

We also consider two methods to integrate the three types of 
locality. First, we choose the maximum one among them as u’s 
locality, i.e.. 


/(tt) — max| /follower (tt), /followee(^);/initiator (tr)} . (4) 


Second, their weighted combination is used as the locality of 
u, i.e.. 


/(tt) — Cl/follower (ti) C2/followee (tl) T C3/initiator(^) ; (3) 

where 0 < ci, £ 2 , £3 < 1 and £1 + £2 + £3 = 1- In this paper, 
we choose each of them to be 1/3 for simplicity and leave other 
possible assignments as the future work. 

Finally, we iteratively find the target users based on one of 
their five types of locality with regard to the seed set S. In each 
iteration, we compute the locality for each candidate u G C ac¬ 
cording to Eq. 0, Eq. 0, or Eq. 0. The candidate with the 
highest locality is removed from C and added to S' as a new 
seed, as this user contributes most to the tightness of the com¬ 
munity around S. In addition, the follower, followee, and/or 
initiator locality values of the remaining candidates in C need 
be updated in every iteration. Here we just use the followee 
locality to illustrate the updating operation. Let /f^owee(^) 
note the followee locality for candidate u in iteration m > 0, 
where /’^°/ (rt) can be computed by using the initial seeds in S. 
Assuming that u* has been chosen as a new seed in iteration 
TO, we update the followee locality for candidate u as 

+ l/\N^{u)\ ifu*GN^{u), 

- I o.w. 

(6) 

Follower and initiator locality can be updated similarly, and we 
may need to update the overall locality according to Eq. 0 or 
Eq.0. The iteration terminates when the seed set S contains a 
desired number of users in A, denoted by r^. Then the sought 
target users correspond to all the users in A. The complete 
process is summarized in Alg. which is implemented using 
a max-priority queue ID- 

The termination threshold ta can be chosen in two ways. 
First, we can set ta as the estimated number of Twitter users in 


Algorithm 2: Identify target users in A from C. 


input : S, C, ta 

output: U, i.e., the users in A 

1 u^s-, 

2 Compute /(u), Vm G C, according to Eq. 0, 0 or 0; 

3 Q ^ 0; 

4 for u G C do 

5 L INSERT(Q,u); 


6 

7 

8 
9 

10 


while |/7| < Ta do 

u* G- EXTRAC-MAX(Q); 

U ^ U + {u*}, S ^ S + {u*}; 
for u G Nf (u*) do 

L INCREASE-KEY(Q,u,/(m) -f l/\Nf{u)\); 


11 return U. 


A, e.g., about 15.1% of the population in A if A is in U.S. m- 
Second, ta can be chosen according to the level of confidence 
we desire. In particular, our algorithm essentially ranks all the 
candidate users according to our confidence about their loca¬ 
tions in A. The later a candidate user is added to U, the lower 
confidence we have that he is indeed in A. Therefore, if we 
want to obtain a set of target users in A with high confidence, a 
small TA should be used; if we want to cover more users in A, 
a larger ta is suitable. 

We now analyze the complexity of Alg. In Lines 4-5, 
we build a max-priority queue Q based on each candidate’s 
locality value, of which the complexity is OdCI log |C|). The 
loop beginning from Line 6 is used to find the target user one at 
a time. In each iteration, we extract the maximum value from 
the priority queue Q in Line 7, set it as a new seed in Line 8, 
and update the locality value of all its followers in Lines 9-10. 
The complexity of Line 6-10 is ©(rAc/logdCI)), where d is the 
average degree in A. Hence the overall complexity of Alg.j^is 
0(dC| + TAd)logdC|)). 

One may wonder why we do not add more candidates to C 
once a candidate is added as a new seed to S. We have shown 
in Section 4.2.2 that the candidate users discovered through 
the initial seed set S cover the majority of users in A with 
overwhelming probability. It is thus unlikely that we can iden¬ 
tify more candidate users from newly identified seeds, which 
has been validated by our simulations in Section [53] We thus 
choose not to add more candidates in each iteration. 


4.4 Cost Analysis 

We now analyze the cost of Loclnfer, which consists of the 
crawling cost and computation cost, and briefly compare it with 
the existing methods. 

We first analyze the crawling cost of Loclnfer, which is im¬ 
portant given the tight rate limitations Twitter enforces on data 
crawling. First, Step 1 in Loclnfer involves invoking the Twit¬ 
ter geo-search API continuously to obtain the initial seed set 
S and needs to crawl some geo-tagged users’ tweets. Second, 
Step 2 requires crawling the followees and followers of each 
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seed user in S. Finally, Step 3 needs to crawl the followees, 
followers, and initiators of each candidate user in C. Recall 
that d denotes the average number of followers and followees 
each seed user has. Our datasets in Section show that d is 
approximately 600. It has also been reported that 15.1% U.S. 
people use Twitter m and that 15.9% of Twitter users report 
city-level locations and become seeds in Locinfer. In Locln- 
fer, a user is chosen as a candidate if he has t followers and t 
followees in S. So we can expect that the candidate set size 
\C\ is much smaller than d\S\, i.e., 14.4 times the population 
in the target area A. In contrast, all previous (potential) solu¬ 
tions p[-[TT| involve crawling all the Twitter users. Thus Locin¬ 
fer has a much smaller crawling cost, which makes it practical. 

The computation cost of Locinfer is dominated by the third 
step with the complexity of Alg.|^being ©((ICI+TAd) log(|C|)), 
where d is the average neighbors of each user and ta is the 
number of target users in A. 

4.5 Countermeasure 

Locinfer aims to discover the majority of users in any tar¬ 
get area even if many of them do not disclose their locations 
explicitly in their personal profiles. We propose a simple coun¬ 
termeasure here to alleviate the possible concerns of some sen¬ 
sitive users about their location privacy. Since Locinfer dis¬ 
covers a user’s location based on his tight connections with 
other users in the same area, the user can effectively hide his 
home location by following, retweeting, mentioning, and re¬ 
plying Twitter users outside his home area on a regular basis. 
This strategy is meaningful because people can follow or inter¬ 
act with others who are in different areas but share the same 
interests. For example, a user in New York City and the other 
in Los Angeles may interact in Twitter because they were uni¬ 
versity classmates in Dallas or knew each other in a concert. 
The efficacy of this countermeasure is evaluated in Section [53] 

5. PERFORMANCE EVALUATION 

In this section, we thoroughly evaluate Locinfer. As stated 
before, this paper targets a different problem with existing work 
@{ID^ and hence we will not compare Locinfer with them 
head to head but could incorporate with them in our future 
work. 

5.1 Methodology 

To evaluate Locinfer, we first need build a testing multigraph 
G = {V,E) formed by both users known to be and not be in 
a target area A, where one challenge is that we cannot directly 
determine all the Twitter users in A. 

To tackle this challenge, we adopt the method used by exist¬ 
ing work MB- Specifically, since the self-reported locations 
have been found reliable GD, for each area A in Table [T] we 
treat all the seed users in S discovered in the first step as the 
positive ground truth (i.e., they are indeed in A) and randomly 
partition S into a seed subset S of size ajS”! and a testing subset 
T of size (1 — Q!)|S'|. 

For the negative ground truth, we check the followers and 


followees of S and record the set of users who have specified 
a location outside A and randomly choose /? fraction of these 
users, where /3 is set as the ratio of seed users over the esti¬ 
mated number of Twitter users in A, as shown in the fourth 
column of Table [T] We denote by 0 the resulting user set and 
let y = S'U0. We finally compute edges among all the users in 
V according to their followings and interactions by analyzing 
their followers, followees, and the latest 600 tweets. 

We then apply Locinfer to the testing multigraph G. Specifi¬ 
cally, we first use S as the seed set and apply Alg.[^to generate 
the candidate set C. We then apply Alg. |^to G to generate the 
target user set U by choosing a ta. Following the definitions in 
Section]^ the coverage can be computed as \U n S'|/|S'|, and 
the accuracy can be computed as \U n 51/1171 (|17| = ta). 

Unless stated otherwise, we choose t = 2 when building the 
candidate set G with Alg. for LA and t = 1 for all other 
three datasets, and set a = 0.159, the average ratio for the four 
datasets in Table[T] The testing multigraphs are summarized in 
Table m 


Table 4: The testing multigraphs for the evaluation, (a = 

0.159)_ 


A 

|5| 

|5| 

\T\ 

/3 

|0| 

TS 

28,161 

4,478 

23,683 

18.65% 

162,446 

PI 

144,033 

22,901 

121,132 

15.9% 

630,321 

Cl 

318,632 

50,662 

267,970 

22.21% 

1,529,431 

LA 

300,148 

47,724 

252,424 

12.12% 

710,085 


5.2 Accuracy 

We first evaluate the accuracy of Locinfer. We compute five 
locality values for each user, including follower locality, fol- 
lowee locality, initiator locality, and the two locality values de¬ 
fined in Eq. Q and Eq. (|^, respectively. 

Fig. shows the accuracy of Locinfer for the four datasets, 
where a = |5|/|5| = 0.159 and ta = |5|. We can see 
that the five locality metrics all lead to high accuracy in each 
area, and initiator locality has the worst performance among 
them. Specifically, the average accuracy of four datasets for 
each locality are 73.2%, 72.6%, 62.3%, 72.4%, 71.9%, respec¬ 
tively. The reason is that initiator locality depends on interact¬ 
ing edges (corresponding to replies, mentions, and retweets) 



LA PI TS Cl 

Location 


8 


Figure 2: The accuracy of Loclufer. 
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Figure 3: Detailed accuracy illustration. 







-•-Followers 
-B-Combined 1 
-V- Combined 2 
— Followees 
-^Initiators 


0.12 0.14 0.16 0.18 0.2 0.22 0.24 




(a) TS 


(b) PI 


(a) For different locations. 


(b) For different localities. 


Figure 4: The impact of a. 


which are much sparser than following edges in the directed 
Twitter multigraph as shown in Fig.[2 Therefore, if many users 
in A only follow many people but do not interact with them 
subsequently, they may be reachable from seed users through 
following edges but not from interacting edges. We will show 
the coverage for different locality metrics in the following Sec¬ 
tion 5.3 Moreover, the locality defined in Eq. Q and Eq. (|^ 
have nearly the same accuracy with both the follower and fol- 
lowee locality. This is expected because about 96.2% of the 
seed set’s initiator neighbors are from their followers or fol- 
lowees, as indicated in Section]^ 

To shed more light on the accuracy of Locinfer, we set ta = 
\C\ so that U = C U S when Alg. 2 terminates, i.e., every 
candidate user is eventually added into S. Let U' denote the 
newly discovered users (may not in A), i.e., U' = C. We 
partition U' into 100 bins of equal size |(7'|/100 according to 
the order they are added, where the bins of smaller indexes 
contain the users discovered earlier. Let Xi denote the number 
of positive ground-truth users in the Ath bin. Eig.[^ shows the 
accuracy of the i-th bin, which is defined as the ratio of the 
number of positive ground-truth users in the Ath bin and the 
number of users in each bin and is computed as lQQxi/\U'\. 
We can see that the accuracy in each bin decreases as the bin 
index increases, which is expected, as the later the users are 
added to U', the less likely they are indeed located in A. 

Fig- 0 shows the impact of a = [S'!/IIS'! on the accuracy 
of Locinfer. As expected, the accuracy under all locality met¬ 
rics increases as a increases. The reason is that the larger the 
a, the more seeds, and the easier the target users in A can be 
discovered. The downside is that more seeds lead to a larger 
candidate set and thus higher crawling and computational cost, 
as Alg.[T]needs to check all the neighbors of the seeds. 


Figure 5: The impact of t. 



Figure 6: The tradeoff between the coverage and accuracy. 
The solid and dash curves are the coverage and accuracy; 
the marks o, A, o, x represent TS, PI, Cl, and LA, respec¬ 
tively. 

Fig. 0 shows the impact of t on the accuracy by varying t 
from one to six. Specifically, Fig. |5(a)| shows the accuracy for 
four areas using the followee locality, while Fig. |5(b)| shows the 
accuracy for different locality metrics by using the PI dataset. 
Both figures show the accuracy decreases as t increases. This 
is expected because increasing t will result in the decrease in 
the size of candidate set and hence miss more users in the tar¬ 
get user list who have no chance to appear in the candidate set. 
However, there is a tradeoff between the accuracy and cost be¬ 
cause smaller candidate set will also bring the lower crawling 
and computational cost. 

5.3 Coverage 

Fig.j^shows the coverage of Locinfer when a = 0.159 with 
the desired number of target users (i.e., ta = \U\), varying 
from zero to the whole candidate set size |C|. We use both the 
followee and follower locality in this experiment. As expected. 
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the larger ta, the more users in T contained in U, the higher 
coverage, and vice versa. When we set ta = \C\, the average 
coverage of these four locations by using followee, follower, 
and initiator locality is equal to 86.3%, 86.6%, and 79.7%, re¬ 
spectively. As stated, since the interacting edges (correspond¬ 
ing to replies, mentions, and retweets) are much sparser than 
following edges in the directed Twitter multigraph as shown 
in Fig. [T] the initiator locality has less coverage than the fol¬ 
lowee and follower locality. Moreover, the average coverage 
by using the follower or followee locality in Fig. is con¬ 
sistent with Corollary Specifically, the average number of 
mutual followers dm for four datasets is 7.8, 9.0, 11.6, and 
11.6, respectively. According to Corollary[2 when a = 0.159, 
= 1) > 82.3% which coincides with our results. 

5.4 Accuracy and Coverage Tradeoff 

Fig. i also shows the anticipated tradeoff between the cov¬ 
erage and accuracy. As we can see, the larger ta, the more 
the positive ground-truth users will be added to U, resulting in 
higher coverage. However, a larger ta will also introduce neg¬ 
ative ground-truth users into U, resulting in lower accuracy. 
This tradeoff could guide us to choose the parameter ta. On 
the one hand, if one desires higher coverage, a large termina¬ 
tion threshold ta should be used, but it is possible that many 
users in U may be not indeed in A. On the other hand, if one 
wants to be certain that the users discovered by Loclnfer are 
most likely in A, a smaller ta should be used at the cost of 
possibly missing some users indeed in A. 

5.5 Effectiveness of Countermeasure 

To evaluate the efficacy of this countermeasure, we let each 
user in the testing set T in each area additionally follow or be 
followed by a certain number of users from 0 who are not in 
A, and we refer to those following edges as camouflage edges. 
Fig.[7]shows the accuracy result under this countermeasure. As 
we can see, the accuracy of Loclnfer decreases as the number 
of camouflage edges increases, highlighting the efficacy of the 
countermeasure. Besides adding random following edges, a 
user can also retweet, mention, and reply to random users on a 
regular basis to counteract Loclnfer, which is expected to yield 
the similar results as these interactions can also decrease the 
geographic locality. 



(a) Followees (b) Followers 


Figure 7; Countermeasure efficacy. 


6. RELATED WORK 


In this section, we briefly present the existing work mostly 
related to this paper. 

Inferring a Twitter user’s hidden location has been widely 
studied in the community, which can be categorized as cont¬ 
ented-based and network-based methods. Content-based meth¬ 
ods @0 try to infer the user’s location by his tweets. For ex¬ 
ample, Cheng et al. 0 proposed a probabilistic framework to 
estimate a Twitter user’s location based on his tweets, result¬ 
ing in placing 51% of Twitter users within 100 miles of their 
home locations. Mahmud et al. 0 further improved this result 
to 64% for city-level location inference. Hecht et al. 0 thor¬ 
oughly studied the location profiles for the Twitter users and 
found that 34% of the users either left them empty or just non¬ 
geographic information. They also inferred the user’s country 
and state information by checking their tweets. Network-based 
methods try to estimate a Twitter user’s locations by his neigh¬ 
bors flUT). Jurgens 0 aimed to infer all the users’ location 
by building a global networks and then propagating location 


assignments from several seeds. Yamaguchi et al. \ 101 built 


several distributed landmarks and then inferred a user’s loca¬ 
tion based on the connections with them. Compton et al. GD 
inferred the locations of all the users in Twitter by minimiz¬ 
ing their distances with the labelled users. Moreover, Li et 
al. 0 combined the content and network information to ob¬ 
tain the more accurate estimation. All these schemes seek to 
address the same question: how can we infer a user’s hidden 
location from all his location-related tweets and/or neighbors’ 
locations? This paper targets a different problem: could we dis¬ 
cover all or the majority of Twitter users in a metropolitan area? 
Directly adopting these existing methods to address our prob¬ 
lem will result in scanning the whole Twitter network. More¬ 
over, the accuracy of Loclnfer outperforms the state of the art 
in 0. 

This paper is also related to privacy disclosure and protec¬ 
tion in OSNs in general. Li et al. m used the neighbors’ 
locations to infer the location in the emerging location-based 
social networks. Sun et al. m protected the location privacy 
on the social crowdsourcing networks. Mao et al. p2) used 
the tweets to detect the Twitter users’ situational leak such as 
vacation status, drunk status, and medical conditions. Dey et 
al. | [23) leveraged the information from neighbors to estimate 
the age of Facebook users. Mislove et al. p4| also used the lo¬ 
cal connections around the Facebook users to infer their hidden 
attributes such as major, college, and political view. Our paper 
is complementary to these work and also highlights that current 
OSNs have emerged as an arguable threat to users’ privacy. 


7. CONCLUSION AND FUTURE WORK 

This paper presented Loclnfer, a novel system that is able 
to discover the majority of Twitter users in any geographic 
area. Detailed experiments confirmed the high efficacy and ef¬ 
ficiency of Loclnfer. We also proposed a countermeasure to 
hide the locations of sensitive users from Loclnfer and evalu¬ 
ated its efficacy with experiments driven by real datasets. 
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There are some open issues to study in our future work. 
First, when constructing a reliable seed set S at the first step, 
we assumed the credibility of the self-reported locations and 
used the heuristic method to refine the seed set. For the future 
work, more advanced methods can be used to refine the seed 
set, and it is also interesting to check the impact of the seed set 
credibility on the ultimate performance. Second, the accuracy 
of Locinfer can be further improved by incorporating the ex¬ 
isting content-based methods @0 and other signals such as 
timezone and language. 
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